Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations
نویسندگان
چکیده
Text-Video Retrieval plays an important role in multi-modal understanding and has attracted increasing attention recent years. Most existing methods focus on constructing contrastive pairs between whole videos complete caption sentences, while overlooking fine-grained cross-modal relationships, e.g., clip-phrase or frame-word. In this paper, we propose a novel method, named Hierarchical Cross-Modal Interaction (HCMI), to explore multi-level relationships among video-sentence, clip-phrase, frame-word for text-video retrieval. Considering intrinsic semantic frame relations, HCMI performs self-attention frame-level correlations adaptively cluster correlated frames into clip-level video-level representations. way, constructs video representations frame-clip-video granularities capture content, text at word-phrase-sentence the modality. With text, hierarchical learning is designed i.e., frame-word, which enables achieve comprehensive comparison modalities. Further boosted by adaptive label denoising marginal sample enhancement, achieves new state-of-the-art results various benchmarks, Rank@1 of 55.0%, 58.2%, 29.7%, 52.1%, 57.3% MSR-VTT, MSVD, LSMDC, DiDemo, ActivityNet, respectively.
منابع مشابه
Topic-Centered Multi-Level Representations for Text Retrieval
Motivation: The amount of information widely available in electronic form is growing at an enormous rate. It is generally accepted that this holds great promise for applications as diverse as basic research, news, entertainment, and on-line social communities. Generally useful techniques for sifting through this mostly unstructured stuff are in great demand, as can be seen by the proliferation ...
متن کاملNews Video Retrieval using Multi-modal Query-dependent Model and Parallel Text Corpus
This paper describes a fully automated news video retrieval system that is capable of retrieving relevant shots using a multimedia query. The emphasis we adopted is three-fold. First, we explore the use multi-modal features such as speaker identification, video OCR, face recognition and Name-entities in ASR text, along with pseudo relevance feedback, for video retrieval. Second, we employ query...
متن کاملImproving Cross-Language Text Retrieval with Human Interactions
Can we expect people to be able to get information from texts in languages they cannot read? In this paper we review two relevant lines of research bearing on this question and will show how our results are being used in the design of a new Web interface for cross-language text retrieval. One line of research, “Interactive IR”, is concerned with the user interface issues for information retriev...
متن کاملCross-modal Retrieval by Text and Image Feature Biclustering
We describe our approach to the ImageCLEF-Photo 2007 task. The novelty of our method consists of biclustering image segments and annotation words. Given the query words, we may select the image segment clusters that have strongest cooccurrence with the corresponding word clusters. These image segment clusters act as the selected segments relevant to a query. We rank text hits by our own tf.idf ...
متن کاملCross-modal Embeddings for Video and Audio Retrieval
The increasing amount of online videos brings several opportunities for training self-supervised neural networks. The creation of large scale datasets of videos such as the YouTube8M allows us to deal with this large amount of data in manageable way. In this work, we find new ways of exploiting this dataset by taking advantage of the multi-modal information it provides. By means of a neural net...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: IEEE Access
سال: 2022
ISSN: ['2169-3536']
DOI: https://doi.org/10.1109/access.2022.3227973